[BUG] Fix source dataset issue when running link jobs #1193

ThomasHepworth · 2023-04-18T21:44:58Z

Quick Summary

It looks as if there was a bug in vertically_concatenate.py that was causing two source_dataset columns to be produced in various steps, where source_dataset was outlined in the settings object.

This was causing issues in DuckDB, where I couldn't use a vertically concatenated dataframe with a source_dataset column for a link job -- i.e. a single df with source_dataset which outlines which dataset a record belongs to.

But, more troubling than this, it appears that this was just outright breaking link only jobs in spark. I'll post some code that breaks below.

Essentially though, it was causing this behaviour in concat_with_tf, which was causing spark to throw a wobbly.

This code fixes the issue by migrating _source_dataset_column_name to the linker class and checking whether the column already exists within the user's database.

Internals and why I opted to go down this path:

To start - these changes don't adjust the underlying SQL/logic that's being used wherever I've replaced linker._settings_obj_source_dataset_column_name with linker._source_dataset_column_name.

The workflow is still:

Use the alias table name provided by the user to create a new column in concat_with_tf
Use this to filter where required

The change is in the naming convention used and what we output in predict().

Now if the users provides a dataframe that already contains source_dataset, splink will adjust step 1 to use the alias __splink_source_dataset. This will then be scrapped when it's no longer needed and the user will be left with their original source_dataset in the output.

If the users provides a df without source_dataset, splink won't fall over and will just call step 1 source_dataset.

Why this way?

It doesn't fall over in the no source_datasetcase
It means we don't need any over the top adjustments to the splink internals
It means that even if the users hasn't supplied a truly unique source_dataset column, the link job will still run.

On point 3 - We may want to check the source dataset column in preprocessing and establish if it's valid.

github-actions · 2023-04-18T21:46:03Z

Test: test_2_rounds_1k_duckdb

Percentage change: -29.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1575	2023-04-18	21:46:00	1.33526	1.31404	(detached head)	`c121eb9`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`c121eb9`

Test: test_2_rounds_1k_sqlite

Percentage change: -24.1%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1577	2023-04-18	21:46:00	3.23778	3.23334	(detached head)	`c121eb9`	Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz	2.5939 GHz	`c121eb9`

Click here for vega lite time series charts

ADBond

Looks like a sensible change to me - just one comment re: tests

tests/test_full_example_duckdb.py

ADBond · 2023-04-20T08:57:13Z

tests/test_full_example_duckdb.py

+        ),
+    ],
+)
+def test_link_only(input, source_l, source_r):


Nice test! Really helps understand the change

ThomasHepworth added 3 commits April 18, 2023 22:23

Add link only tests

1db0068

move _source_dataset_column_name to the linker class

150c288

lint

01a8e53

ThomasHepworth requested review from RobinL, RossKen and ADBond April 18, 2023 21:44

ADBond approved these changes Apr 20, 2023

View reviewed changes

ThomasHepworth merged commit 8d3fe05 into master Apr 20, 2023

ThomasHepworth deleted the fix_source_dataset_bug branch April 20, 2023 14:54

ThomasHepworth mentioned this pull request Apr 26, 2023

Fix link only cartesian calc #1204

Merged

This was referenced Nov 8, 2023

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column #1711

Closed

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix source dataset issue when running link jobs #1193

[BUG] Fix source dataset issue when running link jobs #1193

ThomasHepworth commented Apr 18, 2023

github-actions bot commented Apr 18, 2023

ADBond left a comment

ADBond Apr 20, 2023

[BUG] Fix source dataset issue when running link jobs #1193

[BUG] Fix source dataset issue when running link jobs #1193

Conversation

ThomasHepworth commented Apr 18, 2023

Quick Summary

Internals and why I opted to go down this path:

github-actions bot commented Apr 18, 2023

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

ADBond left a comment

Choose a reason for hiding this comment

ADBond Apr 20, 2023

Choose a reason for hiding this comment